<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>控制词干提取 | Elasticsearch: 权威指南 | Elastic</title>
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
</head>
<body>
<div class="main-container">
    <section id="content">
        
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原文地址: <a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/controlling-stemming.html" rel="nofollow">https://www.elastic.co/guide/cn/elasticsearch/guide/current/controlling-stemming.html</a>, 版权归 www.elastic.co 所有<br/>
                            英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-stemming.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-stemming.html</a>
                            </div>
                        <!-- start body -->
                  <div class="page_header">
<b>请注意:</b><br>本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch: 权威指南</a></span>
»
<span class="breadcrumb-link"><a href="languages.html">处理人类语言</a></span>
»
<span class="breadcrumb-link"><a href="stemming.html">将单词还原为词根</a></span>
»
<span class="breadcrumb-node">控制词干提取</span>
</div>
<div class="navheader">
<span class="prev">
<a href="choosing-a-stemmer.html">« 选择一个词干提取器</a>
</span>
<span class="next">
<a href="stemming-in-situ.html">原形词干提取 »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="controlling-stemming"></a>控制词干提取<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/230_Stemming/50_Controlling_stemming.asciidoc">edit</a>
</h2>
</div></div></div>
<p>开箱即用的词干提取方案永远也不可能完美。
尤其是算法提取器，他们可以愉快的将规则应用于任何他们遇到的词，包含那些你希望保持独立的词。
也许，在你的场景，保持独立的 <code class="literal">skies</code> 和 <code class="literal">skiing</code> 是重要的，你不希望把他们提取为 <code class="literal">ski</code> （正如 <code class="literal">english</code> 分析器那样）。</p>
<p>语汇单元过滤器 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-keyword-marker-tokenfilter.html" class="ulink" target="_top"><code class="literal">keyword_marker</code></a> 和
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-stemmer-override-tokenfilter.html" class="ulink" target="_top"><code class="literal">stemmer_override</code></a>
能让我们自定义词干提取过程。</p>
<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="preventing-stemming"></a>阻止词干提取<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/230_Stemming/50_Controlling_stemming.asciidoc">edit</a>
</h3>
</div></div></div>
<p>语言分析器（查看 <a class="xref" href="configuring-language-analyzers.html" title="配置语言分析器">配置语言分析器</a>）的参数 <a class="xref" href="configuring-language-analyzers.html#stem-exclusion"><code class="literal">stem_exclusion</code></a>
允许我们指定一个词语列表，让他们不被词干提取。</p>
<p>在内部，这些语言分析器使用
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-keyword-marker-tokenfilter.html" class="ulink" target="_top"><code class="literal">keyword_marker</code> 语汇单元过滤器</a>
来标记这些词语列表为 <em>keywords</em> ，用来阻止后续的词干提取过滤器来触碰这些词语。</p>
<p>例如，我们创建一个简单自定义分析器，使用
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-porterstem-tokenfilter.html" class="ulink" target="_top"><code class="literal">porter_stem</code></a>​ 语汇单元过滤器，同时阻止 <code class="literal">skies</code> 的词干提取：</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "no_stem": {
          "type": "keyword_marker",
          "keywords": [ "skies" ] <a id="CO161-1"></a><i class="conum" data-value="1"></i>
        }
      },
      "analyzer": {
        "my_english": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "no_stem",
            "porter_stem"
          ]
        }
      }
    }
  }
}</pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO161-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>参数 <code class="literal">keywords</code> 可以允许接收多个词语。</p>
</td>
</tr>
</table>
</div>
<p>使用 <code class="literal">analyze</code> API 来测试，可以看到词 <code class="literal">skies</code> 没有被提取：</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">GET /my_index/_analyze?analyzer=my_english
sky skies skiing skis <a id="CO162-1"></a><i class="conum" data-value="1"></i></pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO162-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>返回: <code class="literal">sky</code>, <code class="literal">skies</code>, <code class="literal">ski</code>, <code class="literal">ski</code></p>
</td>
</tr>
</table>
</div>
<div class="tip admon">
<div class="icon"></div>
<div class="admon_content">
<a id="keyword-path"></a>
<p>虽然语言分析器只允许我们通过参数 <code class="literal">stem_exclusion</code> 指定一个词语列表来排除词干提取，
不过 <code class="literal">keyword_marker</code> 语汇单元过滤器同样还接收一个 <code class="literal">keywords_path</code> 参数允许我们将所有的关键字存在一个文件。
这个文件应该是每行一个字，并且存在于集群的每个节点。查看 <a class="xref" href="using-stopwords.html#updating-stopwords" title="更新停用词（Updating Stopwords）">更新停用词（Updating Stopwords）</a> 了解更新这些文件的提示。</p>
</div>
</div>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="customizing-stemming"></a>自定义提取<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/230_Stemming/50_Controlling_stemming.asciidoc">edit</a>
</h3>
</div></div></div>
<p>在上面的例子中，我们阻止了 <code class="literal">skies</code> 被词干提取，但是也许我们希望他能被提干为 <code class="literal">sky</code> 。  The
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-stemmer-override-tokenfilter.html" class="ulink" target="_top"><code class="literal">stemmer_override</code></a> 语汇单元过滤器允许我们指定自定义的提取规则。
与此同时，我们可以处理一些不规则的形式，如：<code class="literal">mice</code> 提取为 <code class="literal">mouse</code> 和 <code class="literal">feet</code> 到 <code class="literal">foot</code> ：</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "custom_stem": {
          "type": "stemmer_override",
          "rules": [ <a id="CO163-1"></a><i class="conum" data-value="1"></i>
            "skies=&gt;sky",
            "mice=&gt;mouse",
            "feet=&gt;foot"
          ]
        }
      },
      "analyzer": {
        "my_english": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "custom_stem", <a id="CO163-2"></a><i class="conum" data-value="2"></i>
            "porter_stem"
          ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_english
The mice came down from the skies and ran over my feet <a id="CO163-3"></a><i class="conum" data-value="3"></i></pre>
</div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO163-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>规则来自 <code class="literal">original=&gt;stem</code> 。</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO163-2"><i class="conum" data-value="2"></i></a></p>
</td>
<td align="left" valign="top">
<p><code class="literal">stemmer_override</code> 过滤器必须放置在词干提取器之前。</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO163-3"><i class="conum" data-value="3"></i></a></p>
</td>
<td align="left" valign="top">
<p>返回 <code class="literal">the</code>, <code class="literal">mouse</code>, <code class="literal">came</code>, <code class="literal">down</code>, <code class="literal">from</code>, <code class="literal">the</code>, <code class="literal">sky</code>,
<code class="literal">and</code>, <code class="literal">ran</code>, <code class="literal">over</code>, <code class="literal">my</code>, <code class="literal">foot</code> 。</p>
</td>
</tr>
</table>
</div>
<div class="tip admon">
<div class="icon"></div>
<div class="admon_content">
<p>正如 <code class="literal">keyword_marker</code> 语汇单元过滤器，规则可以被存放在一个文件中，通过参数 <code class="literal">rules_path</code> 来指定位置。</p>
</div>
</div>
</div>

</div>
<div class="navfooter">
<span class="prev">
<a href="choosing-a-stemmer.html">« 选择一个词干提取器</a>
</span>
<span class="next">
<a href="stemming-in-situ.html">原形词干提取 »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>